Tag

#multimodal AI

35 articles

Alibaba’s Qwen Team Launches Qwen3.7-Plus, Adding Vision, Deep Reasoning, Tool Invocation, and Autonomous Iteration on the Bailian Platform

Alibaba's Qwen team launches Qwen3.7-Plus, a multimodal AI model on the Bailian platform, featuring vision understanding, deep reasoning, tool invocation, and autonomous iteration.

Jun 26

MiniMax M3: Open-weight model with a million-token context challenges proprietary leaders

Chinese AI company MiniMax has unveiled M3, the first open-weight model combining top-tier coding performance, a one-million-token context window, and native multimodality, challenging proprietary leaders in the AI space.

Jun 117

MiniMax Releases MiniMax M3 with MSA Architecture Supporting 1M-Token Context, Native Multimodality, and Agentic Coding

Learn how MiniMax M3, a new AI model, can process massive amounts of information and handle multiple types of data like text, images, and video.

Jun 112

StepFun Releases Step 3.7 Flash: A 198B MoE Vision-Language Model for Coding Agents and Search Workflows

Learn how to work with vision-language models like Step 3.7 Flash using Hugging Face Transformers, including multimodal input processing and MoE architecture concepts.

May 2914

9 demos of Gemini Omni and Gemini 3.5 in action

Google showcases Gemini Omni and Gemini 3.5's advanced multimodal capabilities through 9 compelling demonstrations. These AI models demonstrate unprecedented versatility in processing multiple data types and complex reasoning tasks.

May 2925

I was intrigued by Google's new video-cloning Omni AI - then I considered the implications

This explainer examines Google's Gemini Omni AI system, which combines advanced video generation, identity preservation, and natural language processing to create photorealistic video content from text prompts. We explore the technical architecture, key innovations, and broader implications of this multimodal AI platform.

May 2625

Design a Complete Multimodal RLVR Pipeline with Open-MM-RL, Vision-Language Prompting, Reward Scoring, and GRPO Export

Researchers have developed a complete multimodal RLVR pipeline using the TuringEnterprises/Open-MM-RL dataset, integrating vision-language prompting, reward scoring, and GRPO export capabilities.

May 2522

Microsoft Releases Fara1.5: A Family of Browser Computer-Use Agents (4B/9B/27B) That Outperform OpenAI Operator and Gemini 2.5 Computer Use on Online-Mind2Web

Microsoft's Fara1.5 is a new family of browser computer-use agents that can navigate and interact with web interfaces to perform complex tasks. This advancement showcases the growing capabilities of multimodal AI systems in real-world, interactive environments.

May 2126

Hark raises $700M Series A for its secretive ‘universal’ AI interface

Learn what a universal AI interface is and how it could revolutionize how we interact with technology by understanding multiple types of information at once.

May 2113

One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing

ByteDance's Intelligent Creation Lab has released Lance, an open-source unified multimodal model capable of image and video understanding, generation, and editing in a single framework using just 3 billion parameters.

May 2017

tech

You can now remix other people’s YouTube Shorts with AI

This explainer explores the advanced AI technologies behind YouTube Shorts Remix, including multimodal modeling, video understanding, and generative synthesis techniques.

May 2020

Google launches Gemini Omni Flash, a conversational video-generation model with avatar mode held back

Google has launched Gemini Omni Flash, a multimodal video-generation model with avatar mode and default SynthID watermarking. Speech-editing features are being held back for further development.

May 2014